bail on broken pipe when reading udp streams #509

arlyon · 2024-09-18T11:19:41Z

I have an issue with netavark + aardvark-dns + podman where under certain conditions (in my case a container running tailscale as an exit node) aardvark-dns will end up stuck burning 100% cpu (and taking down container dns resolution in the process).

When it happens, aardvark-dns continually prints log lines along the lines of Error parsing dns message: broken pipe. I can't see anywhere where we bail as a result so this PR attempts to resolve that. I am hoping that if we break the loop and the socket is closed then it will just reconnect and continue to work as usual.

I am still figuring out how to get this into my version of coreOS (since it is quite deeply integrated) so that I may try to repro but I wanted to open this PR in case there was any insight here (or whether this line of reasoning is sound).

Additionally, some advice on writing a test for this would be very helpful. I am trying to set up a basic rust test but it is very involved.

openshift-ci · 2024-09-18T11:19:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: arlyon
Once this PR has been reviewed and has the lgtm label, please assign mheon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

packit-as-a-service · 2024-09-18T11:32:32Z

Tests failed. @containers/packit-build please check.

Luap99 · 2024-09-18T11:33:01Z

src/dns/coredns.rs

@@ -87,6 +87,7 @@ impl CoreDns {
                },
                v = receiver.next() => {
                    let msg_received = match v {
+                        Some(Err(err)) if err.kind() == std::io::ErrorKind::BrokenPipe => {break},


This doesn't look right as you break the loop for tcp and udp and there is no logic to "retry" (open the socket again). At the very least it would need to be removed from the thread_handles map so a new container spawn would allow us to add it again.

And we likely should check what kind of errors we can get here and which of those or fatal and which are recoverable in general when reading from the socket. EPIPE just doesn't make any sense to me, it is not documented in recvmsg which should be the underlying syscall (or does the tokio abstraction do anything weird)

Can you attach strace to it and provide the output of when this happens?

I will attempt to trigger it again and see what I can do. I have attached a test as well that demonstrates the hang when a broken pipe occurs using just a simple facade.

I agree that simply breaking the loop is not ideal but it at least stops the server from completely hanging.

(also thanks for the quick reply!)

Without understanding why this happens it doesn't make much sense to just merge "random" fixes and hope the best. I don't see what generates EPIPE here so I like to see a strace at least to understand which syscall is returning that. I agree there might be fatal errors that should cause us to abort but this is not specific to this one I presume and not specific to udp either?

And if we do that then we must update the thread_handles correctly and I guess try to bind the socket again.

I attached strace on my coreOS instance using a rootful toolbox container and have managed to trigger the issue again but rather stupidly did not trace child threads / processes. Trying again now.

openshift-merge-robot · 2024-09-27T23:29:42Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Luap99

FYI I looked a bit around and there seem to be some errors hickory considers hard errors when reading from the socket but they do not include EPIPE hickory-dns/hickory-dns@d2e64d8

So I would still like to understand how you can EPIPE here.

Luap99 · 2024-09-30T09:01:51Z

src/dns/coredns.rs

+    }
+
+    // we need 2 threads or tokio::spawn will block since it never yields
+    #[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 2))]


What is the point of test_log? It seem this can just be dropped so we do not need extra dependencies.

Luap99 reviewed Sep 18, 2024

View reviewed changes

arlyon force-pushed the arlyon/break-on-broken-pipe branch 2 times, most recently from a91adc4 to 49dd0b8 Compare September 18, 2024 12:29

arlyon added 2 commits September 18, 2024 13:30

add test

7ac5296

bail on broken pipe when reading udp streams

c6a2fe3

arlyon force-pushed the arlyon/break-on-broken-pipe branch from 49dd0b8 to c6a2fe3 Compare September 18, 2024 12:30

openshift-merge-robot added the needs-rebase label Sep 27, 2024

Luap99 reviewed Sep 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bail on broken pipe when reading udp streams #509

bail on broken pipe when reading udp streams #509

arlyon commented Sep 18, 2024 •

edited

Loading

openshift-ci bot commented Sep 18, 2024

packit-as-a-service bot commented Sep 18, 2024

Luap99 Sep 18, 2024

arlyon Sep 18, 2024 •

edited

Loading

Luap99 Sep 18, 2024

arlyon Sep 24, 2024

openshift-merge-robot commented Sep 27, 2024

Luap99 left a comment

Luap99 Sep 30, 2024

bail on broken pipe when reading udp streams #509

Are you sure you want to change the base?

bail on broken pipe when reading udp streams #509

Conversation

arlyon commented Sep 18, 2024 • edited Loading

openshift-ci bot commented Sep 18, 2024

packit-as-a-service bot commented Sep 18, 2024

Luap99 Sep 18, 2024

Choose a reason for hiding this comment

arlyon Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Luap99 Sep 18, 2024

Choose a reason for hiding this comment

arlyon Sep 24, 2024

Choose a reason for hiding this comment

openshift-merge-robot commented Sep 27, 2024

Luap99 left a comment

Choose a reason for hiding this comment

Luap99 Sep 30, 2024

Choose a reason for hiding this comment

arlyon commented Sep 18, 2024 •

edited

Loading

arlyon Sep 18, 2024 •

edited

Loading